MARKET BASKET ANALYSIS


Alparslan Erol

09/02/2021

Introduction

The goal of this project is to analyze the market basket of consumers. By analyzing baskets, insights related to consumer purchasing behaviors can be obtained. To achieve our goal, a data-set is downloaded from the kaggle. The data-set has 38765 rows of the purchase orders of people from the grocery stores. These orders can be analyzed and association rules can be generated using Market Basket Analysis by algorithms like Apriori Algorithm. Even time series analysis could be conducted but it is not the focus of the this research.

In [1]:
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt
%matplotlib inline
import plotly_express as px
from apyori import apriori
import networkx as nx
from fa2 import ForceAtlas2
import random
import itertools
In [2]:
df = pd.read_csv("Groceries_dataset.csv")
df.head()
Out[2]:
Member_number Date itemDescription
0 1808 21-07-2015 tropical fruit
1 2552 05-01-2015 whole milk
2 2300 19-09-2015 pip fruit
3 1187 12-12-2015 other vegetables
4 3037 01-02-2015 whole milk

Wrangling Dataframe


Casting and printing ETL types of attributes;

In [3]:
df["Member_number"] = df["Member_number"].apply(str)
df["itemDescription"] = df["itemDescription"].apply(str)
df["Date"] = pd.to_datetime(df["Date"])
print(df.dtypes)
Member_number              object
Date               datetime64[ns]
itemDescription            object
dtype: object
In [4]:
df.sort_values(by=["Member_number", "Date"], inplace=True)
df.reset_index(drop=True, inplace=True)
df["value"] = 1
df.head()
Out[4]:
Member_number Date itemDescription value
0 1000 2014-06-24 whole milk 1
1 1000 2014-06-24 pastry 1
2 1000 2014-06-24 salty snack 1
3 1000 2015-03-15 sausage 1
4 1000 2015-03-15 whole milk 1

Histogram

From the histogram below, we can analyze the most/least frequent purchased items. Since number of items is huge, plot is stacked. Since it is a plotly plot, to further observe, you can hover the mouse on the bars and examine the number of occurrences for each item. As we can see, whole milk, other vegetables, rolls/buns are top three items purchased most frequently.

In [5]:
fig_hist = px.histogram(df, "itemDescription", color_discrete_sequence=px.colors.diverging.Spectral,\
                       title="Histogram for Market Basket Analysis", labels={"itemDescription":"Items in Basket"})
fig_hist.update_layout(yaxis_title_text="Number of Occurrences (Count)")
fig_hist.update_xaxes(tickangle=45, categoryorder="total descending")
fig_hist.show()

Preparing pivot table for association rule analysis

In [6]:
df_pivot = df.pivot_table(index=["Member_number", "Date"], columns=["itemDescription"], values=["value"], fill_value=0)
column_names = []
for i, j in df_pivot.columns:
    column_names.append(j)
df_pivot.columns = column_names
df_pivot.head()
Out[6]:
Instant food products UHT-milk abrasive cleaner artif. sweetener baby cosmetics bags baking powder bathroom cleaner beef berries ... turkey vinegar waffles whipped/sour cream whisky white bread white wine whole milk yogurt zwieback
Member_number Date
1000 2014-06-24 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2015-03-15 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 1 0
2015-05-27 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2015-07-24 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2015-11-25 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 167 columns

Creating a seperate dataframe for number of occurrences for each item.

In [7]:
df_count = pd.DataFrame(df.groupby(by=["itemDescription"])["value"].sum().reset_index())\
                    .rename(columns={"value":"count"}).sort_values(by=["count"], ascending=[False]).reset_index(drop=True)
df_count.head()
Out[7]:
itemDescription count
0 whole milk 2502
1 other vegetables 1898
2 rolls/buns 1716
3 soda 1514
4 yogurt 1334

Word Cloud

A word cloud is a visual representation of word frequency. The more commonly the term appears within the text being analysed, the larger the word appears in the image generated. Word clouds are a simple tool to identify the focus of written material. _"dfcount" has the frequencies of item for the word cloud. As you can see below, again whole milk, other vegetables, rolls/buns, soda are printed larger than frankfurter, curd etc.

In [8]:
wc = WordCloud(max_words=3000, background_color="white")
freq_dict = df_count.set_index('itemDescription').T.to_dict('records')
cloud = wc.generate_from_frequencies(freq_dict[0])
plt.figure(figsize=(15, 25))
plt.imshow(cloud, interpolation='bilinear')
#cloud.to_file('word_cloud.png')
plt.title("Word Cloud for Market Basket Analysis", fontsize=18)
plt.axis("off")
plt.show()


Association Rule Learning


Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.[1] Under the domain of market basket research, main goal is to find the relation between the purchased items. For example, let's assume following rule;
$${\displaystyle \{\mathrm {loaf, yogurt} \}\Rightarrow \{\mathrm {milk} \}}$$
Above rule indicates that consumers who purchase the loaf and yogurt together will most likely purchase milk, either. This kind of information can be used in the sales strageties such promotion, pricing, bundling etc.


Concepts for Understanding[2]


Support:


Support is an indication of how frequently the itemset appears in the dataset.

The support of X with respect to T is defined as the proportion of transactions t in the dataset which contains the itemset X.

$${\displaystyle \mathrm {supp} (X)={\frac {|\{t\in T;X\subseteq t\}|}{|T|}}}$$

Let's say we have 4 transactions in total and {loaf, yogurt} is unique and one of them. Then the support of {loaf, yogurt} would be 25%. Let's assume we have another unique transaction which is {loaf, yogurt, milk}. Then the support of {loaf, yogurt, milk} would be 25%, again.


Confidence:


Confidence is an indication of how often the rule has been found to be true.

The confidence value of a rule, X -> Y , with respect to a set of transactions T, is the proportion of the transactions that contains X which also contains Y.

Confidence is defined as:

$${\mathrm{conf}}(X\Rightarrow Y)={\mathrm {supp}}(X\cup Y)/{\mathrm {supp}}(X)$$


We have given examples in support section. Let's assume all examples above are still valid. Then;


$${\displaystyle \mathrm{conf}(\{\mathrm {loaf, yogurt} \}\Rightarrow \{\mathrm {milk} \}}) == 1.0 $$

Because we assumed that these transactions are unique among other transactions.


References


  1. (Piatetsky-Shapiro, Gregory (1991), Discovery, analysis, and presentation of strong rules, in Piatetsky-Shapiro, Gregory; and Frawley, William J.; eds., Knowledge Discovery in Databases, AAAI/MIT Press, Cambridge, MA)
  2. https://en.wikipedia.org/wiki/Association_rule_learning#cite_note-piatetsky-1

Preparing a seperate dataframe for association rule learning.

In [9]:
analysis_df = df.groupby(['Member_number','Date'])['itemDescription'].apply(','.join).reset_index()
analysis_df["itemDescription"] = analysis_df["itemDescription"].apply(lambda row: row.split(","))
analysis_list = list(analysis_df.itemDescription)
print(analysis_list[0:5])
analysis_df.head()
[['whole milk', 'pastry', 'salty snack'], ['sausage', 'whole milk', 'semi-finished bread', 'yogurt'], ['soda', 'pickled vegetables'], ['canned beer', 'misc. beverages'], ['sausage', 'hygiene articles']]
Out[9]:
Member_number Date itemDescription
0 1000 2014-06-24 [whole milk, pastry, salty snack]
1 1000 2015-03-15 [sausage, whole milk, semi-finished bread, yog...
2 1000 2015-05-27 [soda, pickled vegetables]
3 1000 2015-07-24 [canned beer, misc. beverages]
4 1000 2015-11-25 [sausage, hygiene articles]

Preparing the run_apriori_algorithm method for running apriori algorithm.

In [10]:
def run_apriori_algorithm(list_for_items, min_support, min_confidence):
    rules = apriori(list_for_items, min_support=min_sup, min_confidence=min_conf)
    frules = []
    for r in rules:
        for o in r.ordered_statistics:
            conf = o.confidence
            supp = r.support
            x = list(o.items_base)
            y = list(o.items_add)
            temp_list = (x, y, supp, conf)
            #print("{%s} -> {%s}  (supp: %.3f, conf: %.3f)"%(x,y, supp, conf))
            frules.append(temp_list)
    cols = ["{X} ->", "{Y}", "Support (>%s)"%min_sup, "Confidence (>%s)"%min_conf]
    result_df = pd.DataFrame(frules, columns=cols).sort_values(by="Confidence (>%s)"%min_conf, ascending=False).reset_index(drop=True)
    return result_df

min_sup = 0.01
min_conf = 0.1

resulting_df = run_apriori_algorithm(analysis_list, min_sup, min_conf)
In [11]:
resulting_df
Out[11]:
{X} -> {Y} Support (>0.01) Confidence (>0.1)
0 [] [whole milk] 0.157923 0.157923
1 [yogurt] [whole milk] 0.011161 0.129961
2 [rolls/buns] [whole milk] 0.013968 0.126974
3 [] [other vegetables] 0.122101 0.122101
4 [other vegetables] [whole milk] 0.014837 0.121511
5 [soda] [whole milk] 0.011629 0.119752
6 [] [rolls/buns] 0.110005 0.110005

Comments for Association Rule Learning


  • Resulting dataframe from the appriori algortihm is printed above. I put limitations for support and confidence since I wanted to highlight important relations. As it is expected and analyzed in previous graphs, when X is empty, Y is filled with whole milk, other vegetables or rolls/buns items. It makes sense since these are most frequently purchased items. That means when the market basket is empty, most probably consumers will purchase these items. I think, therefore it is not needed to be conduct promotions heavily on these items. Because they are the most basic and important items to purchase. And this theory is proved via mathematically since the support value is significantly greater than other relations. When the other associations are examined, such as yogurt with whole milk or soda with whole milk, it is also logical to associate since again they are the most frequent purchased items.
  • These associations are very useful for suggesting potential new cases. But in our case, we could not find a different item to associate yet.
  • To obeserve associations which would be a surprise, I will rerun the algorithm with smaller support and confidence levels.
In [12]:
min_sup = 0.001
min_conf = 0.01

resulting_df = run_apriori_algorithm(analysis_list, min_sup, min_conf)
resulting_df.head(30)
Out[12]:
{X} -> {Y} Support (>0.001) Confidence (>0.01)
0 [yogurt, sausage] [whole milk] 0.001470 0.255814
1 [rolls/buns, sausage] [whole milk] 0.001136 0.212500
2 [soda, sausage] [whole milk] 0.001069 0.179775
3 [semi-finished bread] [whole milk] 0.001671 0.176056
4 [rolls/buns, yogurt] [whole milk] 0.001337 0.170940
5 [whole milk, sausage] [yogurt] 0.001470 0.164179
6 [detergent] [whole milk] 0.001403 0.162791
7 [ham] [whole milk] 0.002740 0.160156
8 [] [whole milk] 0.157923 0.157923
9 [bottled beer] [whole milk] 0.007151 0.157817
10 [frozen fish] [whole milk] 0.001069 0.156863
11 [candy] [whole milk] 0.002139 0.148837
12 [sausage] [whole milk] 0.008955 0.148394
13 [onions] [whole milk] 0.002941 0.145215
14 [processed cheese] [whole milk] 0.001470 0.144737
15 [processed cheese] [rolls/buns] 0.001470 0.144737
16 [newspapers] [whole milk] 0.005614 0.144330
17 [domestic eggs] [whole milk] 0.005280 0.142342
18 [packaged fruit/vegetables] [rolls/buns] 0.001203 0.141732
19 [seasonal products] [rolls/buns] 0.001002 0.141509
20 [cat food] [whole milk] 0.001671 0.141243
21 [waffles] [whole milk] 0.002606 0.140794
22 [hamburger meat] [whole milk] 0.003074 0.140673
23 [rolls/buns, soda] [other vegetables] 0.001136 0.140496
24 [other vegetables, yogurt] [whole milk] 0.001136 0.140496
25 [frankfurter] [whole milk] 0.005280 0.139823
26 [sugar] [whole milk] 0.002473 0.139623
27 [chewing gum] [whole milk] 0.001671 0.138889
28 [beef] [whole milk] 0.004678 0.137795
29 [flour] [whole milk] 0.001337 0.136986
In [13]:
resulting_df.shape
Out[13]:
(1269, 4)

Comments for Association Rule Learning


Again the suggestions seems to be according to most frequent purchased items. Because Y value is always top items to purchase. On the other hand, now we have chance to examine X items more closely. Sausage item can be seen more freqeuntly now. Therefore it can be considered for the further sales strategies.

Graph & Network Analysis for Association Rule Learning


To further analyze and better visualize the association between the market basket items, I will create graphs with networkx package and optimize the node positions via forceatlas2 package.

In [14]:
G = nx.Graph()
item_nodes = list(df_count["itemDescription"].unique())
other_nodes = list(analysis_df["itemDescription"])
G.add_nodes_from(item_nodes)
In [15]:
for transaction in other_nodes:
    for comb_items in list(itertools.combinations(transaction, 2)):
        if G.has_edge(comb_items[0], comb_items[1]):
            G[comb_items[0]][comb_items[1]]["weight"] += 1
        else:
            G.add_edge(comb_items[0], comb_items[1], weight = 1)
In [16]:
# Plot network with force atlas 2
forceatlas2 = ForceAtlas2(# Behavior alternatives
                        outboundAttractionDistribution=True, 
                        linLogMode=False,  
                        adjustSizes=False,  
                        edgeWeightInfluence=1.0,

                        # Performance
                        jitterTolerance=1.0,  
                        barnesHutOptimize=True,
                        barnesHutTheta=0.5,
                        multiThreaded=False,  

                        # Tuning
                        scalingRatio=4.0,
                        strongGravityMode=False,
                        gravity=1.0,

                        # Log
                        verbose=True)
positions = forceatlas2.forceatlas2_networkx_layout(G, pos=None, iterations=5000) # undir_G
100%|██████████| 5000/5000 [00:48<00:00, 103.07it/s]
BarnesHut Approximation  took  4.09  seconds
Repulsion forces  took  39.32  seconds
Gravitational forces  took  0.31  seconds
Attraction forces  took  1.75  seconds
AdjustSpeedAndApplyForces step  took  1.05  seconds

In [17]:
sizes_gm = []
colors_gm = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)]) for i in range(G.number_of_nodes())]
for i in G.nodes():
    sizes_gm.append(G.degree(i))
In [18]:
nx.draw_networkx_nodes(G, positions, with_labels=True, node_size=sizes_gm, node_color=colors_gm, alpha=0.8)
nx.draw_networkx_edges(G, positions, alpha=0.25)
plt.rcParams['figure.figsize'] = [20,20]
plt.axis('off')
plt.title('GRAPH OF MARKET BASKET ITEMS', fontsize=20)
plt.draw();

Graph & Network Analysis for Association Rule Learning


Since the graph representation above is so much crowded, I will create a new graph which is less crowded. Total number of transcation is 14963. I will select only 300 of them randomly to see a clear graph.

In [19]:
sampled_other_nodes = random.choices(list(analysis_df["itemDescription"]), k=300)
most_freq_items = ["whole milk", "other vegetables", "rolls/buns"]
print(len(most_freq_items))
print(len(sampled_other_nodes))
print("Printing samples from Sampled Transactions:\n", random.choices(sampled_other_nodes, k=3))
3
300
Printing samples from Sampled Transactions:
 [['detergent', 'pot plants'], ['ice cream', 'curd'], ['other vegetables', 'soda', 'rolls/buns', 'dishes']]
In [20]:
simple_G = nx.Graph()
sample_item_nodes = list(set(list(pd.core.common.flatten(sampled_other_nodes)))) 
simple_G.add_nodes_from(sample_item_nodes)
In [21]:
for transaction in sampled_other_nodes:
    for comb_items in list(itertools.combinations(transaction, 2)):
        if simple_G.has_edge(comb_items[0], comb_items[1]):
            simple_G[comb_items[0]][comb_items[1]]["weight"] += 1
        else:
            simple_G.add_edge(comb_items[0], comb_items[1], weight = 1)
In [22]:
forceatlas2 = ForceAtlas2(# Behavior alternatives
                        outboundAttractionDistribution=True, 
                        linLogMode=False,  
                        adjustSizes=False, 
                        edgeWeightInfluence=1.0,
                        # Performance
                        jitterTolerance=1.0, 
                        barnesHutOptimize=True,
                        barnesHutTheta=0.5,
                        multiThreaded=False, 
                        # Tuning
                        scalingRatio=4.0,
                        strongGravityMode=False,
                        gravity=1.0,
                        # Log
                        verbose=True)

undir_G = simple_G.to_undirected()
positions = forceatlas2.forceatlas2_networkx_layout(undir_G, pos=None, iterations=5000) # undir_G
100%|██████████| 5000/5000 [00:24<00:00, 203.31it/s]
BarnesHut Approximation  took  2.19  seconds
Repulsion forces  took  20.02  seconds
Gravitational forces  took  0.21  seconds
Attraction forces  took  0.27  seconds
AdjustSpeedAndApplyForces step  took  0.81  seconds

In [23]:
sizes = []
colors = ["#"+''.join([random.choice('0123456789ABCDEF') for j in range(6)]) for i in range(simple_G.number_of_nodes())]
for i in simple_G.nodes():
    sizes.append(simple_G.degree(i))
sizes = [i*5 for i in sizes]
In [24]:
nx.draw_networkx_nodes(G=simple_G, pos=positions, with_labels=True, node_size=sizes, node_color=colors, alpha=0.8)
nx.draw_networkx_edges(G=simple_G, pos=positions, alpha=0.1)
nx.draw_networkx_labels(G=simple_G, pos=positions, font_size=10)
plt.rcParams['figure.figsize'] = [20,20]
plt.axis('off')
plt.title('NETWORK OF MARKET BASKET ITEMS', fontsize=20)
plt.draw();

Graph & Network Analysis for Association Rule Learning


  • As we can see from the graph above, there are hubs in the network which are again most frequently purchased items. Node radiuses are greater for these hubs. Items such as whole milk, yogurt, soda, root vegetables are so much intensely connected with other items.
  • It is interesting and meaningful to see that, dental care is connected with candy and salty snacks.
  • Network analysis can be used to see outliers such as dog food if we assume selected 300 transaction is our population not the sample.
  • Network analysis could be conducted for the resulting dataframe of apriori algorithm but again most of the associations are based on the mostly frequently purchased items.